Learnable Crawling: An Efficient Approach to Topic-specific Web Resource Discovery
نویسندگان
چکیده
The rapid growth of the Internet has put us into trouble when we need to find information in such a large network of databases. At present, using topic-specific web crawler becomes a way to seek the needed information. The main characteristic of a topic-specific web crawler is to select and retrieve only relevant web pages in each crawling process. There are many previous researches focusing on the topic-specific web crawling. However, no one has ever mentioned about how the crawler does during the next crawlings. In this paper, we present an algorithm that covers the detail of both the first and the next crawlings. For efficient result of the next crawling, we keep the log of previous crawling to build some knowledge bases: seed URLs, topic keywords and URL prediction. These knowledge bases are used to build the experience of the topic-specific web crawler to produce the result of the next crawling in a more efficient way.
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملLearnable Topic-specific Web Crawler
Topic-specific web crawler collects relevant web pages of interested topics from the Internet. There are many previous researches focusing on algorithms of web page crawling. The main purpose of those algorithms is to gather as many relevant web pages as possible, and most of them only detail the approaches of the first crawling. However, no one has ever mentioned some important questions, such...
متن کاملExpert Discovery: A web mining approach
Expert discovery is a quest in search of finding an answer to a question: “Who is the best expert of a specific subject in a particular domain within peculiar array of parameters?” Expert with domain knowledge in any field is crucial for consulting in industry, academia and scientific community. Aim of this study is to address the issues for expert-finding task in real-world community. Collabor...
متن کاملTopical web crawling for domain-specific resource discovery enhanced by selectively using link-context
To enable topical web crawling, link-context is the critical contextual information of anchor text for retrieving domain-specific resources. While some link-contexts may misguide topical web crawling and extract wrong web pages, because several relevant anchor texts become irrelevant or several irrelevant anchor texts become relevant after calculating the relevance between the link-contexts and...
متن کاملFocused Crawling: A New Approach to Topic-Specific Web Resource Discovery
The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplar...
متن کامل